Phoneme-based Statistical Transliteration of Foreign Names for OOV Problem

نویسنده

  • GAO Wei
چکیده

Given a source language term, machine transliteration is to automatically generate the phonetic equivalents in a target language. It is useful in many cross language applications. Recently, there are increasing concerns about automatic transliteration, especially with languages with significant distinctions in their phonetic representations, e.g. English and Chinese. Despite many cross-language applications in English/Chinese, machine transliteration between the two languages has not been studied comprehensively. Existing English-Chinese transliteration techniques are typically based on source-channel framework, e.g. IBM SMT model. The accuracy of this model is rather low. In this thesis, we propose to use a direct approach for English-to-Chinese transliteration. We propose two direct transliteration models: In the first model, we model the problem as direct phonetic mapping from English phonemes to a set of rudimentary Chinese phonetic symbols plus dynamically discovered mapping units from training process. An effective algorithm for alignment of phoneme chunks is presented. In the second model, contextual features of each phoneme are taken into consideration by means of Maximum Entropy formalism, and it is further refined with the precise alignment scheme based on phoneme chunks. We compared the direct approaches with the source-channel baseline implemented with the IBM SMT model, and showed that the second approach is significantly superior.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Phoneme-Based Transliteration of Foreign Names for OOV Problem

One problem seriously affecting CLIR performance is the processing of queries with embedded foreign names. A proper noun dictionary is never complete rendering name translation from English to Chinese ineffective. One way to solve this problem is not to rely on a dictionary alone but to adopt automatic translation according to pronunciation similarities, i.e. to map phonemes comprising an Engli...

متن کامل

Language Independent Transliteration System Using Phrase-based SMT Approach on Substrings

Everyday the newswire introduce events from all over the world, highlighting new names of persons, locations and organizations with different origins. These names appear as Out of Vocabulary (OOV) words for Machine translation, cross lingual information retrieval, and many other NLP applications. One way to deal with OOV words is to transliterate the unknown words, that is, to render them in th...

متن کامل

Optimizing Transliteration for Hindi/Marathi to English Using only Two Weights

Machine transliteration has received significant research attention in last two decades. It is observed that Hindi to English and Marathi to English named entity machine transliteration is comparably less studied. Currently, research work in this domain is carried out by using grapheme based statistical approaches. But, to achieve better accuracy for the transliteration, an adequate bilingual t...

متن کامل

Hindi-to-Urdu Machine Translation through Transliteration

We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliterat...

متن کامل

Extracting English-Korean Transliteration Equivalence from Domain-Specific Dictionaries

Automatic translation knowledge acquisition or automatic bilingual dictionary construction has become an important first step for natural language applications such as machine translation and cross-language information retrieval. Transliterations are used to translate proper names and technical terms especially from languages in Roman alphabets to languages in non-Roman alphabets such as from E...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004